Skip to content

Conversation

@lichuang
Copy link
Contributor

close #5172 , add support of Utf8View and BinaryView, test these two types in test_query_string

@github-actions github-actions bot added the enhancement New feature or request label Jan 11, 2026
@lichuang
Copy link
Contributor Author

lichuang commented Jan 12, 2026

It seems that the compilation error(https://github.com/lance-format/lance/actions/runs/20890689101/job/60043036449?pr=5685) has nothing to do with the modifications made in this pr. @wjones127

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to accomplish this without having to convert in so many places.

Also, could you validate FTS indices build correctly? I think one function you'll need to update is here:

https://github.com/lance-format/lance/blob/5513248362b25daa1c381ef4bfb0da0939a8f4d0/rust/lance-arrow/src/lib.rs#L458-L464

@lichuang lichuang force-pushed the support-utf8-view branch 2 times, most recently from 73f8249 to 29c016b Compare January 14, 2026 14:49
@lichuang
Copy link
Contributor Author

@wjones127 please review this pr again

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looking good so far.

Note that before we merge this change, we'll need to change the spec to add these new data types first. So the next steps (which I've started are):

  • Post a discussion about the changes #5817
  • Vote on the spec changes (1 week voting period)
  • If the vote passes, merge the spec changes
  • Merge this PR

Comment on lines 587 to 590
// Check if we need to convert offsets from i32 to i64 (for LargeUtf8/LargeBinary)
let offsets_buffer = if self.bits_per_offset == 32
&& (data_type == DataType::LargeUtf8 || data_type == DataType::LargeBinary)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127

The offsets conversion logic at lines 587-590 is necessary for backward compatibility. It handles the case when reading legacy data that was stored with 32-bit offsets but is now being read as LargeUtf8/LargeBinary (which use 64-bit offsets).

This is not introduced by this PR - it's existing code to handle historical data formats. I've added more detailed comments to clarify this purpose.

For Utf8View and BinaryView types in this PR, this conversion is not triggered because we store them with 64-bit offsets (as implemented in arrow_view_to_data_block).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for StringView and BinaryView

2 participants